The Centers for Medicare & Medicaid Services has published several datasets that contain information regarding the procedures and services that Medicare benificiaries receive from various healthcare providers.
These datasets are divided by into three main categories, according to the type of healthcare provider:The purpose of this research is to analyze anomalous payment information in these datasets, with a primary focus on the Inpatient and the Physician and Other Supplier databases since they are the 2 largest ones.
According to the Economist, Medicare fraud adds about “$98 billion, or roughly 10%, to annual Medicare and Medicaid spending”. We hope to develop a set of models that can be run on these datasets and point out anomalous data points for the purposes of fraud detection. Ideally, these anomaly detection models can be applied to future releases of Medicare datasets as well.
The Physician and Other Supplier databases contain Medicare information on physicians and other healthcare providers, along with the procedures they offered during the year 2012.
The Physician and Other Supplier PUF contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. This PUF is based on information from CMS’s National Claims History Standard Analytic Files. The data in the Physician and Other Supplier PUF covers calendar year 2012 and contains 100% final-action physician/supplier Part B non-institutional line items for the Medicare fee-for-service population.
There are 2 databases related to Physician and Other Supplier Medicare data that we will be using. Below are detailed instructions for importing and preprocessing the databases into Oracle SQL Developer.
This is the primary database related to Physician and Other Supplier information. As a warning, this database contains over 9 million records, and is stored in a 1.7 GB .txt file. Importing this database can take up to several hours, so be sure to have a stable Internet connection.
To download this database, click on the following link, and click “Accept”. If this does not work, go to http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier.html and click on the link that says “Medicare Physician and Other Supplier PUF, CY2012, Tab Delimited format”.
Extract the contents of the .zip file. The “Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf” document contains more detailed information on this database and the methodology used to aggregate the data. Rename the extension of the “Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt” to “Medicare-Physician-and-Other-Supplier-PUF-CY2012.csv”. This will allow the text file to be in a format that can be imported into your Oracle database.
Now, open up Oracle SQL Developer and connect. Go to the directory sql/ddl/ and open the “PHYSICIAN_AND_OTHER_SUPPLIER_B.sql” DDL script, and run it in Oracle. This will create a table called PHYSICIAN_AND_OTHER_SUPPLIER_B. Find that table in your connection, right click, and click on “Import Data” as seen in the screenshot below.
Find and open “Medicare-Physician-and-Other-Supplier-PUF-CY2012.csv”. The Data Import Wizard window will come up. Set Format to “delimited” and set Delimiter to “tab”. Click Next.
Click Next until you get to Step 4 of 5, and then set Match By to “Position”. Click Next once more, and then click Finish.
We will add some columns to this table later, but first we must import 1 more database.
To download this database, click on the following link, and click “Accept”. If this does not work, go to http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier.html and click on the link that says “Medicare National and State/HCPCS Aggregate table, CY2012, Microsoft Excel (.xlsx)”.
Extract the contents of the .zip file. Open the “Medicare-National-and-State-HCPCS-Aggregate-CY2012.xlsx” Excel file. Go to the third page called “data”, select the first row and delete it.
Now, it is possible that Oracle can import this Excel file as is, but Oracle SQL Developer was only able to import csv files on the computer that this research was conducted on. So, additional preprocessing is required.
First, highlight columns E-H and click Format Cells. For Category, select “Number”. Uncheck the box that says “Use 1000 Separator (,)”. Select the top row for the section that says “Negative numbers:”. Click OK. Use the screenshot below if necessary.
Now, select columns I-AC (all the columns with ‘$’ signs), and go to Format Cells. Select “Number” for Category, and uncheck the box that says “Use 1000 Separator (,)”, as shown in the screenshot below. Click OK and save the file.
Make sure that you are still on the “data” page in the Excel workbook, and click Save As, and save this file as a .csv file. There will be 2 messages that come up. Click ‘OK’ then click ‘Yes’. You can ignore the second warning message because the data does not get corrupted.
Now that we have our .csv file, we can import the data into Oracle. Go to the directory sql/ddl/ and open the “PHYSICIAN_HCPCS_AGGREGATE.sql” DDL script, and run it in Oracle. This will create the PHYSICIAN_HCPCS_AGGREGATE table. Right click on this table in Oracle and select Import, as you did in the previous section. Find and open the .csv file that you just created. When the Data Import Wizard window comes up, you can keep clicking Next, and then Finish. You don’t need to change any settings here.
After the table has finished importing, go to the directory sql/ and open the “state abbrev.sql” SQL script, and run that in Oracle as well. This will change the NPPES_PROVIDER_STATE_DESC column in PHYSICIAN_HCPCS_AGGREGATE to be state abbreviations instead of the full name of the state (i.e. ‘Texas’ becomes ‘TX’), so that it matches up with the PHYSICIAN_AND_OTHER_SUPPLIER table.
This section shows how to add the BENE_DAY_SRVC_PCT and SUBMITTED_CHRG_VS_HCPCS_AGGR columns to what will ultimately be the PHYSICIAN_AND_OTHER_SUPPLIER database. This involves some complicated SQL queries, and there is no easy way to do this except for creating and dropping two more tables.
So run the following SQL scripts in succession:Finally, we have our 2 final databases pertaining to physician data, PHYSICIAN_AND_OTHER_SUPPLIER and PHYSICIAN_HCPCS_AGGREGATE.
The following descriptions are taken from Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf (except for the last 2 columns, which were manually added).
This database consists of the following columns:The following descriptions are taken from Medicare-National-and-State-HCPCS-Aggregate-CY2012.xlsx on page 2 (Documentation).
This database consists of the following columns:STD_MEDICARE_PAYMENT_AMT - Standard deviation of the Medicare payment amount. The standard deviation indicates the amount of variation from the average Medicare payment amount that exists within state, HCPCS service, and place of service.
The first point of interest is to determine whether Medicare is consistent with how they reimburse healthcare providers. In order to do this, we take the ratio of the amount Medicare pays (AVERAGE_MEDICARE_PAYMENT_AMT) to the amount allowed by Medicare for a procedure (AVERAGE_MEDICARE_ALLOWED_AMT).
As can be seen from the above figure, there is a very strong linear correlation between the Medicare paid amount and the Medicare allowed amount for each procedure, with virtually no outliers based on both the extremely low p-value as well as simple visual inspection. This indicates that Medicare is very consistent with how they reimburse healthcare providers based on the allowed amount for a particular procedure (which is based on the HCPCS code and location).
Since Medicare is very consistent with how they reimburse providers, it is possible that a healthcare provider will purposely bill Medicare for procedures that are known to be expensive. Also, it’s possible that a healthcare provider can commit fraud by billing Medicare a much higher amount compared to other providers conducting the same procedure in the same geographic area (aggregated by state in our databases).
This section describe the workflow that was used for anomaly detection. Below is a screenshot of the entire workflow.
“NPPES_PROVIDER_STATE” != ‘XX’ Or “NPPES_PROVIDER_STATE” != ‘AA’ Or “NPPES_PROVIDER_STATE” != ‘AE’ Or “NPPES_PROVIDER_STATE” != ‘AP’ Or “NPPES_PROVIDER_STATE” != ‘AS’ Or “NPPES_PROVIDER_STATE” != ‘GU’ Or “NPPES_PROVIDER_STATE” != ‘MP’ Or “NPPES_PROVIDER_STATE” != ‘PR’ Or “NPPES_PROVIDER_STATE” != ‘VI’ Or “NPPES_PROVIDER_STATE” != ‘ZZ’
After that, there is another Filter Rows node called “HCPCS Filters”, which will filter out procedures that have were conducted by at least 100 providers and had a median Medicare Allowed Amount of $1000. Recall the linear relationship between Medicare Payment and Medicare Allowed Amount. The reason Medicare Allowed Amount is chosen for the filter is because it is extremely predictable. A healthcare provider that is potentially committing fraud is going to bill Medicare for procedures that are known to have a high Medicare Allowed Amount. This node can be re-created by copying the text from hcpcs_filter.txt in the HCPCS_Filter/ directory.
Read this section ONLY if you wish to change the parameters for filtering out certain procedures. As mentioned in the previous section, the filter is set for procedures that have a median Medicare allowed amount of at least $1000 and were conducted by at least 100 providers. Otherwise, skip to the next section to continue reading about the anomaly detection workflow.
First, go to the sql/ directory and run “hcpcs filters.sql”. This will print out a list of HCPCS Codes that match with at least 100 providers and have a median Medicare Allowed Payment Amount that is at least $1000. Basically, this query is filtering out the most expensive procedures that were still conducted by enough providers so as not to end up with false positives in the anomaly detection. You can change these parameters in the SQL script as desired. If you do change the parameters, copy and paste the output into a .txt file.
Next, go to the HCPCS_Filter/src directory and open FilterHCPCS.java. Make sure to change the input path in line 17 to be the .txt file that you just created. Set the output path in line 19 to wherever you want. Compile and run the program. Copy and paste the text in the output file from the path that you specified into the “HCPCS Filters” node in the Physician Anomaly workflow.
For the Filter Columns node, the below setting were used. This is where the importance of the two calculated columns, BENE_DAY_SRVC_PCT and SUBMITTED_CHRG_VS_HCPCS_AGGR comes into play. These are the two main attributes used by the Anomaly Detection node. The NPPES geographic information is filtered out because the state has already been accounted for in SUBMITTED_CHRG_VS_HCPCS_AGGR. The rest of the geographic information and other columns that were filtered out cause too much noise, and throw off the anomaly detection.
For the Anomaly Detection node, Case ID is set to “ID”. Also, the outlier rate should be changed to 0.01 in Advanced Settings as seen below. The original setting of 0.10 is far too high.
In the Additional Output tab of the Apply node, all the columns from the source table are added. Finally, this is exported into the table called PHYSICIAN_ANOMALY, which holds the results of this particular anomaly detection model.
Before showing the results, it is important to understand the two columns BENE_DAY_SRVC_PCT and SUBMITTED_CHRG_VS_HCPCS_AGGR. Also, keep in mind that each row in the Physician database is aggregated by provider (NPI), procedure (HCPCS code), and place of service (facility or non-facility). This will help in understanding the methodology for these two columns.
For a given type of procedure and place of service, this is the percentage of all procedures that one provider conducted. For example, suppose that we wish to look at HCPCS Code 93010 (Electrocardiogram report) in a non-facility (‘O’). Assume that in there were a total of 1000 beneficiary per day services (basically, the number of procedures). Now assume that a provider with NPI (identification number) 1003000522 conducted 100 beneficiary per day services for this particular procedure in a non-facility. Then, the BENE_DAY_SRVC_PCT for this particular record will be 100/1000 = 0.1. Anytime, the BENE_DAY_SRVC_PCT for a certain procedure is unusually high, it will be flagged as an anomaly because it means that this provider conducted an usually high amount of services for this procedure relative to other providers. This does not necessarily mean that it is a case of fraud; it could just be an unusually large practice that sees lots of patients. However, it is worth inspecting, since the provider could billing Medicare for this procedure when it is not necessary because it is known to have a high reimbursement.
For a given procedure and place of service, this is the ratio of the Medicare submitted charge from one provider versus the median submitted charge for all providers in that same state that conducted this same procedure. Basically, it is designed to tell us whether a provider is overbilling (or underbilling) for a procedure relative to other providers. For example, suppose provider 1003000522 charged Medicare $1000 for HCPCS Code 93010 (Electrocardiogram report) in a non-facility (‘O’) setting. Assume that he is based in Florida. Also, assume that for all providers in Florida that conducted this same procedure in a non-facility setting, their median submitted charge to Medicare was $100. Then, the SUBMITTED_CHRG_VS_HCPCS_AGGR for this particular record will be $1000/$100 = 10. To adjust for cost of living differences across the United States, this column was aggregated by state.
The following graphs display BENE_DAY_SRVC_PCT along the x-axis and SUBMITTED_CHRG_VS_HCPCS_AGGR along the y-axis. The anomaly predictions (‘0’ for anomaly, ‘1’ for normal) are color coded. So anything that is high up on either axis should be flagged as an outlier and merits inspection. It was important to display both of these columns on each visualization because they are both accounted for in anomaly detection.
The below graph maps out every record that was in PHYSICIAN_ANOMALY. As mentioned previously, it contains records for HCPCS Code-Place of Service combinations that matched up with more than 100 providers and had a Medicare allowed payment amount of at least $1000. Some of the NPI labels can be seen. This is from page 2 in Physicians.twb in the Tableau/ directory.
The above graph is meant just to illustrate the general distribution of results. Below are several graphs that show results for specific procedures according to the HCPCS code and place of service. Only the procedures containing significant outliers are displayed. These visualizations are from page 3 in Physicians.twb in the Tableau/ directory. The NPI and state are labeled next to some of the outlying points
There are a total of 77 different graphs, but these were the ones with the most notable outliers. The first two graphs show providers that charge an unusually high amount for a procedure, and the last four show providers that conduct an unusually high number of a certain type of procedure. At this point, this is all the information that can be gleaned from public records. We have developed the model for finding anomalous data points. However, more information would be needed to actually check why these providers show up as anomalous, such as financial documents for the individual providers. Unfortunately, that information is not available to the public at this point. So, while these data points are anomalous and merit inspection, it would be premature to say that they are definitely cases of fraud. As stated in the purpose statement, this goal of this project is to develop a model for pointing out anomalies that could then be analyzed and audited by an agency such as the CMS or the OIG (https://oig.hhs.gov/fraud/strike-force/).
These databases contain Medicare information on inpatient providers and the procedures that they offered within a fiscal year period.
The Inpatient PUF contains hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG). This PUF is based on information from CMS’s Medicare Provider Analysis and Review (MEDPAR) inpatient data. The data in the Inpatient PUF contains 100% final-action (i.e., all claim adjustments have been resolved) IPPS discharges for the Medicare fee-for-service (FFS) population. The Inpatient PUF is available for calendar years 2011 and 2012.There are a total of 3 databases related to Inpatient Medicare data:
There are 3 databases related to Physician and Other Supplier Medicare data that we will be using. Below are detailed instructions for importing and preprocessing the databases into Oracle SQL Developer.
First, go to sql/ddl/ and run the INPATIENT_2012_DATA_B.sql DDL script. To download the database, click on the following link, or go to http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient2012.html and click the link that says “Inpatient Charge Data, FY2012, Comma Separated Values (CSV) version”. Extract the contents of that zip file. Right click on your INPATIENT_2012_DATA and click Import and select the .csv file you just downloaded. On Step 4 of the Import Wizard and select “Position” on the Match By section. Click Next and then Finish.
Next, go to sql/ddl/ and run the INPATIENT_AGGREGATE.sql DDL script. Then, click on the following link, or go to http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient2012.html and click the link that says “National and State Summaries of Inpatient Charge Data, FY2012, Comma Separated Values (CSV) version”. Extract the contents. Go to Oracle, and right click on the INPATIENT_2012_NATIONAL_DATA table and select Import. Select Medicare_Charge_Inpatient_DRG100_DRG_Summary_by_DRG_FY2012.csv and keep clicking Next on the Import Wizard. After this table is finished, right click on INPATIENT_2012_STATE_DATA, select Import, and Medicare_Charge_Inpatient_DRG100_DRG_Summary_by_DRGState_FY2012.csv. Keep clicking Next on the Import Wizard and allow the table to finish Importing.
To add the calculated columns, simply run the DDL script INPATIENT_2012_DATA.sql in sql/ddl/. This will create the INPATIENT_2012_DATA table.
The Inpatient databases were the only ones from the entire Medicare data set in which there were instances of Medicare paying more than what a provider charged them. This is very unusual because Medicare will almost always reimburse providers less than what the provider charged.
First let us look at the overall distribution of Medicare Payment to Charged Amount ratios across the entire database.
We can clearly see just how anomalous it is for the ratio be greater than 1. In fact, there were 340 cases in which this occurred out of a total of 157747 (roughly 0.22%).
In order to analyze this further, the distribution of these anomalies was examined on a geographic level.
The above two figures display the geographic distribution of these particular anomalies. We can see that New York City and Boston have the most amount of procedures where Medicare overpaid.
Below is a screenshot of the workflow that was used for Medicare overpayment data:
To replicate this workflow, select INPATIENT_2012_DATA as the source node. Then add a Filter Column node, and name it Payment Ratios. Below are the settings that should be used for this node: The above node filters out everything except the Medicare Payment to Charged Amount Ratio, Total Payment to Charged Amount Ratio, and Number of Discharges.
Then, add another Filter Column node, and name it “Medicare Payment to Charged Amt Ratio”. This node will filter out everything except the Medicare to Charged Amount Ratio, as seen below:
Finally, add the Anomaly Detection, Apply, and Create Table nodes. Remeber to set Case ID to “ID” and the rate of outliers to 0.01 for the Anomaly Detection. Add all the columns from the original table to the Apply node in the Additional Output section. Name the tables as seen in the first screenshot for this workflow.
Below are the results of this workflow. All graphs show the Medicare Payment to Charged Amount Ratio on the y-axis and the Anomaly SVM Probability on the x-axis. The first two graphs map out all records in the Inpatient database. The rest of the graphs show the most anomalous hospitals, several of which had no normal data points.
The below graphs are for specific hospitals and they will use the results of the first anomaly detection model from this section, which included all the payment ratios and the number of total discharges. We want our models to be as multidimensional as possible.
There are many more hospitals that contain anomalous data points, but the above ones contained by far the most outliers. All of the results can be seen in the Inpatient.twb Tableau workbook on the page titled “Overpayment Anomalies by Provider - Ratios + Discharges Included”.
If you wish to see every single hospital that had an instance of Medicare overpayment, run the following SQL query: “select provider_name,count(*) as “Count”,provider_city,provider_state from inpatient_2012_data where MCPAYMENT_CHARGEDAMT_RATIO > 1 group by provider_name,provider_city,provider_state order by “Count” desc;“.
Now, we examine the anomalies in the Inpatient database from the provider side, where a provider is either overbilling for procedures or conducting unnecessarily expensive procedures. This is very similar to the anomaly detection model that was run on the Physician databases.
This is is the screenshot of the workflow that was used.
“PROVIDER_STATE” != ‘XX’ Or “PROVIDER_STATE” != ‘AA’ Or “PROVIDER_STATE” != ‘AE’ Or “PROVIDER_STATE” != ‘AP’ Or “PROVIDER_STATE” != ‘AS’ Or “PROVIDER_STATE” != ‘GU’ Or “PROVIDER_STATE” != ‘MP’ Or “PROVIDER_STATE” != ‘PR’ Or “PROVIDER_STATE” != ‘VI’ Or “PROVIDER_STATE” != ‘ZZ’
Set the Filter Columns settings as seen below.
Then, add the Anomaly Detection, Apply, and Create Table nodes with the same settings we’ve been using for all the previous sections.
Also, note that here there was no need to filter out the expensive procedures the way we did in the Physician table. This is because there are only 100 types of procedures included in the database, and the national average covered charge for the least expensive one is still more than $15,000. So, all these procedures are extremely expensive.
The following graphs display DISCHARGE_PCT along the x-axis and SUBMITTED_CHRG_VS_DRG_AGGR along the y-axis. The anomaly predictions (‘0’ for anomaly, ‘1’ for normal) are color coded. We are most interested in anything that is high up on either axis, which should merit inspection.
DISCHARGE_PCT is analagous to the BENE_DAY_SRVC_PCT from the Physician database. Suppose there were 1000 discharges across the nation for a particular DRG procedure “638 - DIABETES W CC”. Then suppose a provider with ID 220024 had 100 discharges for this same procedure. The DISCHARGE_PCT for this record would then be 100/1000 = 0.1.
SUBMITTED_CHRG_VS_DRG_AGGR is analagous to the SUBMITTED_CHRG_VS_HCPCS_AGGR from the Physician database. The charged amount for a procedure done by an individual provider is compared to the average charged amount for all occurrences of that same procedure in the same state as the provider.
Below is the graph for all the records in the Inpatient database. This is taken from Inpatient.twb on the page title “Submitted Charge Outliers and DRG Discharge Percentage Outliers”
Finally, here are graphs pertaining to particularly anomalous procedures. The provider ID and HRR Description (geographic region) are labeled for some of the points. This is taken from the page titled “Submitted Charge Outliers and DRG Discharge Percentage Outliers by Procedure” in Inpatient.twb.
There are a total of 100 different graphs here, but these were the ones with the most notable outliers. One of the limitations here is that for the Submitted Charge vs. DRG Aggregate calculations, the CMS aggregated by state instead of by metro area. For example, several of the outliers on the y-axis are for hospitals in Philadelphia. This may partially be attributed to the fact that Philadelphia is the most expensive city in the state of Pennsylvania. However, there are some metro areas that are very sparsely populated that have only a few discharges for a particular procedure, so it wouldn’t make sense to use that data because it’d be so vulnerable to any outlier.
Lastly, if you wish to see how many providers conducted each procedure, run the following SQL query: “select drg_definition, count(provider_id) from inpatient_2012_data group by drg_definition order by count(provider_id) desc;” The least frequent procedure (Psychoses) is still conducted by 617 different providers, so that is why the threshold for Discharge Pct being considered anomalous is so low (roughly greater than 0.5%).
As stated before, more information would be needed to actually check why these providers show up as anomalous.
We have analyzed several different kinds of anomalies here. For both the Inpatient and Physician datasets, the possibility of healthcare providers overbilling was examined, along with the possibility of a provider billing Medicare for unnecessary procedures that are known to be expensive. In addition, we looked at cases where Medicare paid out more than what they were charged.
For the Physician and Other Supplier dataset, Medicare was found to be extremely consistent with how they reimbursed physicians. They paid out 80% of the Medicare allowed amount for a given procedure. Since these payments were found to be very predictable, the next step was to analyze how the providers billed procedures. Only the relatively expensive (Medicare Allowed Amount >= $1000) and common (number of providers offering the procedure >= 100) procedures were looked at. Generally speaking, if a provider charged more than 6.5 times the median charged amount for a particular procedure in his or her state, it was almost guaranteed to be flagged as anomalous (from the graph titled “Submitted Charge Outliers and HCPCS Beneficiary/Day Percentage Outliers - ALL”). From the same graph, if a provider was responsible for more than 1.8% of all the discharges in the nation for that particular procedure, then it was also flagged as anomalous. This model was meant to account for overbilling for procedures and conducting expensive procedures with high frequency. Note that there was still some middle ground between these two parameters (i.e. Bene_Day_Srvc_Pct = 0.5% and Submitted Chrg vs HCPCS Aggr = 5 was an anomalous data point). The procedures with HCPCS codes 67040, 67041, 37229, 78608, 78816, and 95951 had the most apparent cases of these outliers.
Now, the Inpatient dataset was unique in that it was the only one that had instances of Medicare reimbursing hospitals with an amount greater than what they were charged. These cases were especially prominent in the New York City and Boston metro areas. Upon further examination, it was revealed that out of the 340 instances of reimbursement exceeding the charged amount, the top 4 hospitals were either from Boston or New York City, and those 4 hospitals alone (Lincoln Medical & Mental Health Ctr - NYC, Bronx-Lebanon - NYC, Carney - Boston, St. Elizabeth’s - Boston) accounted for 1/3 of these cases. Beyond that, it was difficult to determine why these particular hospitals had so many anomalous procedures and why they were so concentrated in the Northeast just by looking at public records.
Finally, for the Inpatient dataset, overbilling and high frequency for expensive procedures was examined in the same manner as the Physician data. If a hospital charged more than 3 times the average charged amount for a particular procedure in that same state, it was considered anomalous (from the graph titled “Submitted Charge Outliers and DRG Discharge Percentage Outliers”). From the same graph, if a provider was responsible for more than 0.5% of all the discharges in the nation for that particular procedure, then it was also flagged as anomalous. Although this seems like a low percentage, bear in mind that the least frequent procedure (Psychoses) was still conducted by 617 different hospitals. Also, there was still some middle ground between these two parameters (i.e. Discharge Pct = 0.4% and Submitted Chrg vs DRG Aggr = 2 was an anomalous data point). The procedures with DRG codes 057, 189, 251, 885, and 897 contained the most apparent anomalies.
As stated several times, these anomalies are not necessarily going to be fraud cases. The purpose of this research was to develop models that would point to particularly anomalous cases that could then be used by a Medicare fraud detection agency, such as the CMS or OIG.
Lastly, the importance of data preprocessing cannot be understated. You may notice that the actual workflows in Oracle Data Miner were relatively simple. This is because so much time was spent filtering out certain rows and columns that added unnecessary noise. Additionally, adding the calculated columns was also necessary. For example, it was not enough to send in the AVERAGE_COVERED_CHARGES and AVERAGE_MEDICARE_PAYMENTS on their own into the anomaly detection. Without taking the ratio of those 2 columns and placing it in the the MCPAYMENT_TOTALPAY_RATIO column in the Inpatient databases, the instances of overpayment would not have been flagged as anomalous. There are several other examples of this throughout this project, and that is why so much time and attention was given to properly importing and preprocessing the databases.